Skip to content

Use compact runtime shape refs internally#4624

Closed
robacourt wants to merge 15 commits into
mainfrom
rob/internal-shape-handles
Closed

Use compact runtime shape refs internally#4624
robacourt wants to merge 15 commits into
mainfrom
rob/internal-shape-handles

Conversation

@robacourt

@robacourt robacourt commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Introduces an in-memory numeric shape_id (the "runtime shape ref") that replaces the binary shape_handle for all internal routing, filtering, and identity. Shapes are held millions of times across ETS tables, maps, and sets; each shape_handle is a ~30-byte binary, so swapping the internal key to a small integer is a substantial memory win — the largest amplifier being the subquery index, where the key was duplicated 4–6× per node per shape.

shape_handle is preserved at every external/durable boundary: the HTTP API, on-disk + storage-ETS keys, the ShapeStatus authority's public face, the client change-notification Registry, and the subquery tag-hash. The shape_id is in-memory only — never persisted, freshly minted each boot.

How to review

The change is deliberately structured so most of it is trivial to review. It breaks into four parts:

1. Mechanical changes (the bulk — ~17 of 23 lib files are 90–100% mechanical)

Pure shape_handleshape_id renames: function params, @specs, ETS tuple keys, message tuples, local vars, and is_shape_handleis_integer guard swaps. No behaviour change. These modules are key swaps only, no algorithm change:
consumer_registry.ex, event_router.ex, filter.ex, filter/indexes/subquery_index.ex, materializer.ex, where_clause.ex, request_batcher.ex, partitions.ex, dependency_layers.ex (algorithm unchanged — only the caller now supplies resolved ids), ref_resolver.ex, state.ex, effects.ex, setup_effects.ex, event_handler/subqueries/{steady,buffering}.ex, dynamic_consumer_supervisor.ex (partition hash now over id — behaviour-equivalent).

2. Core logic — shape_status.ex (the one place to read closely)

This is the feature. ShapeStatus becomes the authority that mints ids and holds the bidirectional map:

  • shape_id_table ETS (id → handle reverse map) + mint_id/1 (atomic :ets.update_counter on a :seq row).
  • id_for_handle/2, handle_for_id/2, fetch_shape_by_id/2, and shape_handle_for_log/2 (returns "unknown, id: N" on miss so log/telemetry callers can resolve unconditionally).
  • The shape_meta_table row grows to a 6-tuple (id appended); ids are minted both in add_shape/2 (new shapes) and populate_shape_meta_table/2 (restore/boot), and the reverse-map row is cleaned up in remove_shape/2.

The id is never written to SQLite — the shape_db/connection/query persistence layer is untouched.

3. Planned boundary (handle ↔ id resolution, as designed)

Resolution happens once at each edge, never on the steady-state hot path:

  • shape_log_collector.ex dependency_ids/2 and event_handler_builder.ex dep_id_for_handle!/2 — resolve a shape's persisted dependency handles → ids at consumer setup.
  • consumer.ex consumer_pid_for_handle/2 — the single handle→id resolver for cold-path/handle-boundary callers; routing-identity functions (name, stop, consumer_pid) are id-only.
  • shape_handle_for_log/2 used at id-only log/telemetry sites.

Deliberately kept on shape_handle: the API/plug boundary, all Storage.* and on-disk paths, the client Elixir Registry pub-sub (register_for_changes/Registry.dispatch), PublicationManager/RelationTracker (one-entry-per-shape, low memory value), and the subquery tag-hash (subquery_tags.ex/move_broadcast.ex are unchanged — their md5("#{stack_id}#{shape_handle}…") is streamed/persisted in :tags headers and must byte-match the Postgres-side hash in querying.ex).

4. Implementation-time additions (the only non-mechanical extras — review these)

Almost all are the unavoidable consequence of introducing a lookup: once a path resolves handle→id, it must handle the lookup missing (a shape concurrently deleted):

  • Deletion-race branches around new lookups: shape_cache.ex restore path (:error -> halt), shape_cleaner.ex (resolve id before ShapeStatus.remove_shape deletes the mapping; nil-guard helpers stop_consumer/3, remove_from_shape_log_collector/2; suspend-path :error -> :ok), consumer.ex all_materializers_alive?.
  • shape_cache.ex start_shape/4 + start_materializers/3 restructured to a with chain: a materializer-id miss now aborts and purges before starting the consumer, instead of starting both regardless (symmetric graceful handling).
  • snapshotter.ex looks the consumer up via ConsumerRegistry.whereis(id) directly so a snapshot-failure report still reaches it after cleanup has removed the handle→id mapping.
  • shape_status.ex put_handle_id/3 — a Mix.env() == :test-gated ETS seeder for routing tests (compile-gated; cannot reach prod).

Guarantees verified

  • shape_id never reaches the wire (no shape_id in api/, plug/, log_items.ex, or :tags/electric-handle headers).
  • id is in-memory only and never persisted (re-minted each boot).
  • Subquery tag-hash stays handle-keyed and byte-compatible with the Postgres side.
  • The two registries stay distinct (custom ConsumerRegistry = id; client Elixir Registry = handle).
  • No internal function accepts both a handle and an id.

Testing

Full suite green: 391 doctests, 9 properties, 1833 tests, 0 failures. Subquery move-in/move-out integration and electric-handle header round-trip covered by existing suites.

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.46%. Comparing base (ee0da19) to head (80bd892).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4624   +/-   ##
=======================================
  Coverage   59.46%   59.46%           
=======================================
  Files         385      385           
  Lines       43039    43039           
  Branches    12383    12380    -3     
=======================================
+ Hits        25591    25594    +3     
+ Misses      17371    17369    -2     
+ Partials       77       76    -1     
Flag Coverage Δ
packages/agents 72.64% <ø> (ø)
packages/agents-mcp 77.70% <ø> (ø)
packages/agents-mobile 80.67% <ø> (ø)
packages/agents-runtime 83.48% <ø> (+0.01%) ⬆️
packages/agents-server 75.47% <ø> (ø)
packages/agents-server-ui 7.51% <ø> (ø)
packages/electric-ax 51.06% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 91.83% <ø> (+0.11%) ⬆️
packages/y-electric 56.05% <ø> (ø)
typescript 59.46% <ø> (+<0.01%) ⬆️
unit-tests 59.46% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread .changeset/small-shape-refs.md Outdated
@claude

claude Bot commented Jun 18, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

Since the last review the PR has been completely rewritten (all-new commits, rob/internal-shape-handles) to remove the handle-vs-ref ambiguity the earlier reviews flagged. The internal routing/filtering/identity layer is now uniformly keyed by an in-memory numeric shape_id minted by ShapeStatus, while shape_handle is preserved at every external/durable boundary (HTTP, storage/on-disk, the client Registry, the subquery tag-hash). The earlier mixed-type "optionality" between handle and id is gone — there is now exactly one boundary at each edge. The change is clean, well-commented, and the test suite is reported green (1833 tests, 0 failures); codecov reports all modified lines covered.

What's Working Well

  • The structural concern from the previous round is resolved. Each module is now keyed by a single type: routing identity is id-only (ConsumerRegistry, EventRouter, DependencyLayers, Filter, SubqueryIndex, Materializer, RequestBatcher), and handle/id resolution happens once at each edge via small, clearly-commented helpers (consumer_pid_for_handle/2, dep_id_for_handle!/2, dependency_ids/2, id_for_handle/2). alco's "why the double naming of shape_ref/shape_id?" and "these maps/sets may now have two shapes of values" comments no longer apply — there is no shape_ref and no mixed-value map left.
  • ShapeStatus id lifecycle is correct. mint_id/1 is an atomic :ets.update_counter on a :seq row keyed by an atom (can never collide with the positive-integer ids); add_shape/2 mints and inserts the {id, handle} reverse row; remove_shape/2 resolves the id before deleting the meta row and prunes the reverse row; reset/1 and table (re)creation clear all three tables.
  • The refresh re-mint hazard (the long-open Critical) stays fixed. populate_shape_meta_table/2 reuses the existing id for any handle it already knows (so ids are stable across a refresh) and only mints for genuinely new handles, with an in-code comment explaining why re-minting would orphan the reverse mapping. refresh/1 then sweeps the reverse {id, handle} rows for handles still on the old generation before select_delete removes the stale meta rows — correct ordering, and the new 6-tuple match spec is consistent.
  • Deletion races around the new lookups are handled gracefully and deliberately. start_materializers/3 aborts-and-purges on a missing inner id before starting the consumer; shape_cache.ex restore (start_consumer_for_id / restore_shape_and_dependencies) treats a vanished id as {:error, :no_shape}/{:error, handle}; shape_cleaner.ex resolves the id before remove_shape and nil-guards stop_consumer/3 and remove_from_shape_log_collector/2; the suspend path resolves the still-intact mapping. snapshotter.ex deliberately looks up the consumer via ConsumerRegistry.whereis(id) directly so a snapshot-failure report still lands after the handle/id mapping is gone — and that intent is documented in-code.
  • Observability gap from the previous round is closed. list_shapes_with_ids/1 now Logger.warnings when it drops a shape with no id mapping (the only remaining concurrent-removal window), instead of silently dropping it. Log/telemetry sites that hold only an id use shape_handle_for_log/2, which returns "unknown, id: N" on a miss so callers can log unconditionally.
  • Tests pin the load-bearing invariants (shape_status_test.exs): monotonic-id assignment, shape_handle_for_log fallback after removal, id stability across refresh/1, reverse-mapping prune when a handle is removed elsewhere, and list_shapes_with_ids. The changeset uses robacourt's suggested wording.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

1. Materializer.get_all_as_refs/2 is the one deletion-race site that raises a bare MatchErrormaterializer.ex:95-105

The line {:ok, dep_id} = ShapeStatus.id_for_handle(stack_id, shape_handle) hard-matches {:ok, _}. Every other handle/id resolution either threads :error through gracefully or raises with a descriptive message (dep_id_for_handle!/2 produces "missing shape_id for dependency handle ..."). Here a concurrently-removed dependency would surface as an opaque MatchError. It looks unreachable in practice (guarded by are_deps_filled/1, and I couldn't find a current caller of get_all_as_refs/2), so this is low priority — but for consistency consider reusing the bang-helper (clear message) or dropping the function if it's dead.

2. 45_000 magic timeout still inline in RequestBatcher.add_shape/4request_batcher.ex:64 (carried over, alco's earlier note)

The duplicate GenServer.call concern alco raised is gone (callers go through the defdelegate), but the bare 45_000 remains. A named module attribute (e.g. @add_shape_timeout) with a one-line rationale would document intent. Minor.

Issue Conformance

Still no linked issue. The (now very detailed) PR description matches the implemented scope, including the deliberate boundaries kept on shape_handle (HTTP/plug, all Storage.*/on-disk, the client Registry, PublicationManager/RelationTracker, and the byte-compatible subquery tag-hash). The four guarantees in the description (shape_id never on the wire, never persisted, tag-hash stays handle-keyed, the two registries stay distinct, no function takes both) are consistent with the diff I reviewed. Consider linking a tracking issue for the memory-reduction work.

Previous Review Status

Iteration 9 to 10 (full rewrite):

  • Resolved — structural ambiguity (robacourt's review comment): internal layer is uniformly id-keyed; no mixed handle/id values remain.
  • Resolved — alco's inline comments on request_batcher.ex: the shape_ref/shape_id double-naming and the "two shapes of values" maps are gone (only the 45_000 timeout note lingers, see Suggestion 2).
  • Resolved — silent drop in list_shapes_with_ids/1: now logs a warning.
  • Still standing — refresh id-stability + reverse-map prune: preserved in the rewrite and covered by tests.
  • New (minor): Suggestion 1 (get_all_as_refs/2 hard match).

Nice rewrite — the boundary is now obvious from the code, the deletion-race handling is uniform and documented, and the tests pin the exact invariants the routing layer depends on.


Review iteration: 10 | 2026-06-22

@robacourt robacourt force-pushed the rob/internal-shape-handles branch from c0b57f3 to c82a144 Compare June 18, 2026 11:47

@alco alco left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only reviewed a part of it.

So far, the addition of optionality between shape_ref and shape_handle in multiple modules/functions looks like added complexity we would like to avoid. What's the heuristic I should use when reading the code to know whether a given function can end up taking a shape ref or a shape handle at runtime?

@robacourt robacourt force-pushed the rob/internal-shape-handles branch from 449886b to 6655b70 Compare June 19, 2026 07:53
@robacourt robacourt marked this pull request as draft June 22, 2026 08:18
@robacourt robacourt marked this pull request as ready for review June 22, 2026 11:20
@robacourt robacourt requested a review from magnetised June 22, 2026 11:27
@robacourt

Copy link
Copy Markdown
Contributor Author

I have only reviewed a part of it.

So far, the addition of optionality between shape_ref and shape_handle in multiple modules/functions looks like added complexity we would like to avoid. What's the heuristic I should use when reading the code to know whether a given function can end up taking a shape ref or a shape handle at runtime?

Great points. I've totally re-written it to remove ambiguity and to make the boundary clear.

robacourt and others added 15 commits June 22, 2026 12:34
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Convert the internal routing pipeline from the binary shape_handle to the
numeric shape_id in one atomic change. ConsumerRegistry, EventRouter/Filter,
ShapeLogCollector (+ RequestBatcher/FlushTracker), Partitions and
DependencyLayers are now keyed by id, along with the consumer's registration
name, flush notifications and the removal path.

Client-facing Registry pub-sub, Storage, PublicationManager, the subquery
index seeding/mark_ready and the cleaner's deferred phase remain keyed by
handle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Close the Chunk 3 transitional mixed-key state: the subquery index's
registration side was keyed by shape_id while seeding/membership was still
keyed by shape_handle, leaving every subquery shape permanently in routing
fallback. Flip the remaining index callers (seed_membership, mark_ready,
add_value, remove_value) and the index's own parameters to shape_id so the
exact-membership reader (membership_or_fallback?/member?) matches the
seeded entries again.

Also key the RefResolver and the materializer (process name, link-values
ETS cache, and the {:materializer_changes, dep_id, ...} routing message) by
shape_id, resolving dependency handle -> id at the consumer boundary.

move_broadcast.ex and subquery_tags.ex stay on shape_handle: their tag hash
md5('stack_id' || 'shape_handle' || value) must byte-match the Postgres side
and is streamed/persisted in tag headers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r_log

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@robacourt robacourt force-pushed the rob/internal-shape-handles branch from a1f3712 to 80bd892 Compare June 22, 2026 11:34
@magnetised

Copy link
Copy Markdown
Contributor

@robacourt why have a handle -> id lookup at all. why not just replace shape handles with an integer (system_boot_time (microseconds) + atomic read and increment)? seems like we could lose the external handle vs internal id split and just use an integer everywhere (that's stringified in the url, de-stringified in the request params normalisation). we never use the fact that the handle is composed of the shape hash + timestamp. it's completely opaque so there's no need to keep that handle construction, we could just generate an integer (in a thread safe way). did you consider this? is there something I'm missing?

@robacourt

Copy link
Copy Markdown
Contributor Author

@robacourt why have a handle -> id lookup at all. why not just replace shape handles with an integer (system_boot_time (microseconds) + atomic read and increment)? seems like we could lose the external handle vs internal id split and just use an integer everywhere (that's stringified in the url, de-stringified in the request params normalisation). we never use the fact that the handle is composed of the shape hash + timestamp. it's completely opaque so there's no need to keep that handle construction, we could just generate an integer (in a thread safe way). did you consider this? is there something I'm missing?

I'd prefer this approach, but I'm struggling to find a way of fitting it into an 8 byte number (this PR manages to make it an 8 byte number by keeping it as an internal ID). You can only user 60bits before it starts using 24 bytes (boxed). You could squeeze boot time into 42 bits if it were milliseconds (rather than microseconds) and you used a custom epoch, that would give you ~139 years to play with. Then you'd have 18 bits left so if you went over 262K shapes you'd be into 24 byte numbers again. You also have the problem of two servers started in the same millisecond, but perhaps you could link it to when the database lock is acquired as no two servers can have that at the same time.

24 bytes would still smaller than our current shape handles which work out as 64 bytes once boxed. In reducing the memory footprint of the where clause filter you'd probably end up wanting a 8 byte ref to the 24 byte number which is what this PR was trying to avoid, but maybe that complexity is simpler than the complexity of this PR. What do you think @magnetised ?

@magnetised

Copy link
Copy Markdown
Contributor

@robacourt I don't think a 60 bit integer would limit us to ~300k shapes. I've thought about this a bit here and there and clauded this integer-only design:

Overview

Each ID is a 59-bit non-negative integer composed of a fixed boot prefix and a serial counter. Both prefix components are computed once at server boot; ID generation thereafter is a single atomic increment. Every ID is a single-word Erlang fixnum (no bignum allocation).

Bit layout

  [ 8 bits: boot-year ][ 51 bits: serial counter ]
    └ years since 2026 (0–255)      └ atomic, seeded at boot

Total width 59 bits, within the positive range of a 64-bit BEAM fixnum (max 2⁵⁹−1).

Boot procedure

At startup the server computes and freezes two values:

  1. Year prefix = boot_year − 2026 (8 bits, range 0–255 → 2026–2281).
  2. Counter seed = microseconds elapsed since the start of the boot year, written as the counter's initial value.

The seed spaces independent boots within the same year into different regions of the counter range, preserving rough chronological ordering across boots and reboots.

ID generation

id = (year_prefix << 51) | atomic_increment(counter)

A single atomic fetch-and-add per ID. No clock read after boot.

Behaviour past the boot year

The year field is the boot year, not the current wall-clock year — it is a fixed epoch tag, not a live clock. When a server runs past the calendar year boundary:

  • The prefix stays at its boot value; the counter simply continues climbing in the same bucket.
  • Nothing special happens at the boundary — no rollover, no reseed. IDs remain unique and monotonic.
  • The only ceiling is exhaustion of the 51-bit counter (see capacity below), which takes decades at any realistic rate. A long-running server is unaffected.
  • A reboot adopts the then-current year as its new prefix, moving to a fresh bucket and a fresh seed.

Consequently a server may run indefinitely across many calendar years on a single boot prefix; the prefix distinguishes boot epochs rather than tracking the wall clock.

Capacity

  • Per boot bucket: 2⁵¹ = 2,251,799,813,685,248 (~2.25 quadrillion) IDs.
  • Worst case — booting at the end of a year, seed ≈ a full year of µs (3.15×10¹³): ~2.22 quadrillion IDs still available. The bucket dwarfs a year of microseconds, so even an end-of-year boot consumes ~1.4% of the range.
  • Aggregate across all 256 year prefixes: 2⁵⁹ = 5.76 × 10¹⁷ IDs.

Throughput

  • Bucket-exhaustion ceiling: 2⁵¹ ÷ seconds-per-year ≈ 71.4 million IDs/sec sustained for a full year before a bucket is exhausted (~70× real time).
  • Collision-avoidance ceiling: seeds are microsecond-spaced, so boots one second apart start 1,000,000 apart. A boot stays clear of a later same-year boot's region as long as it averages under 1 million IDs/sec.

Constants

  ┌────────────────────────────────┬───────────────────────────────┐
  │           Parameter            │             Value             │
  ├────────────────────────────────┼───────────────────────────────┤
  │ Epoch                          │ start of 2026                 │
  ├────────────────────────────────┼───────────────────────────────┤
  │ Total ID width                 │ 59 bits (positive fixnum)     │
  ├────────────────────────────────┼───────────────────────────────┤
  │ Year field                     │ 8 bits (2026–2281, 256 years) │
  ├────────────────────────────────┼───────────────────────────────┤
  │ Counter field                  │ 51 bits                       │
  ├────────────────────────────────┼───────────────────────────────┤
  │ IDs per boot bucket            │ 2,251,799,813,685,248         │
  ├────────────────────────────────┼───────────────────────────────┤
  │ Sustained rate (1-year bucket) │ ~71.4 M/sec                   │
  ├────────────────────────────────┼───────────────────────────────┤
  │ Safe rate (collision spacing)  │ ~1 M/sec                      │
  └────────────────────────────────┴───────────────────────────────┘

@robacourt

Copy link
Copy Markdown
Contributor Author

@robacourt I don't think a 60 bit integer would limit us to ~300k shapes. I've thought about this a bit here and there and clauded this integer-only design:

Love this! I'm still trying to work out how we can guarantee there's no collisions though. If we do have a collision, users could see data they're not supposed to see (currently a hash of the shape is included in the handle which prevents this). Collisions sound highly unlikely but there are a few scenarios they could happen:

  1. two servers could start in the same microsecond - but they can't get the lock at the same microsecond, so maybe it could be the time we acquire the lock
  2. two servers could acquire the lock at different times but think it's the same microsecond because their clocks are out of sync (I was thinking we could get the microsecond time from postgres once we have the lock, but there doesn't seem to be a way to do this that's immune to system clock jumps)
  3. The above two scenarios don't have to be the same microsecond, they can be close and the counter from one can increment to the timestamp of the other

@robacourt

Copy link
Copy Markdown
Contributor Author

Closed in favour of #4643

@robacourt robacourt closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants